Module - 4: Information Theory and Coding
On completion of this module, you will be able to:
Define information and discrete messages
Define amount of information
Explain Entropy and its property
Define information rate
Explain mutual information
Explain properties of mutual information
Introduction to Information
- Information is an electrical signals or words or digital data or pictures or music etc.
- The communication systems are designed to transmit different types of information from sender to receiver through medium.
- Information determines the transmission efficiency of communication system.
- In radio broadcasting the information is audio i.e., analog in nature.
- In computers the outputs are discrete in nature. Hence these sources are called discrete sources.
- Example:
Fig. 1: Radio Broadcasting.
Uncertainty & Probability
- Uncertainty is depend on our surprise at an event.
- For example “The sun will rise tomorrow ” Not surprising (P~1) but “The sun will not rise tomorrow ” Very surprising (P<<1).
- So, the uncertainty is inversely related to probability of event.
Uncertainty ∝1 /Pevent
Uncertainty = log (1 / Pevent)= - log Pevent
Introduction to Discrete Messages
- When a signal is quantized, the receiver need determine only that quantized level which encountered at each sampling time
- The discrete messages consists of those quantized levels which can be transmitted directly or encoded into anyone of a number of forms
- One computer sending a bits stream over pulse modulated wire to another computer is an example of discrete system
- Conversation of two people by means of sound waves is an example of a mixed system. The words in the message are discrete but the signal is continuous
Fig. 2: Discrete Sampled Signals.
Amount of Information
- Consider messages m1, m2, m3, m4 ……… transmitted by a communication system and probability of occurrence of these messages are P1, P2, P3, P4 ……. etc respectively
- Then the amount of information transmitted through the signal mk is given as,
Ik = log2 (1 / Pk)
- where, Pk = Probability of occurrence of mk
4.5 Unit of Information
- Basically the unit of an information is bit, when the base of algorithm is 2.
- When the base is e the unit for information is nat .
- While using the base 10 the unit of information becomes hartley or decit .
Problem:
A message signal mk is transmitted by a transmitter. The probability of occurrence of this signal is 1/2 .Calculate the information conveyed by it, in terms of bits, decit and nats.
Solution :
- Given that, Pk = 1/2
- Information in terms of bits ( log base 2 ).
Ik = log2 (1 / Pk)
Ik = log2 (2) = 1 bits
- Information in terms of nats ( log base e ).
Ik = loge (1 / Pk)
Ik = loge (2) = 0.693 nats bit
- Information in terms of decit ( log base 10 ).
Ik = log10 (1 / Pk)
Ik = log10 (2) = 0.301 decit
Introduction to Entropy
- The Entropy is defined as the average information per message. It is represented by the symbol H, and its units are bits/message
- The Entropy should be as high as possible in order to ensure maximum transfer of information .
- Consider M different and independent messages m1, m2, ..... with probability of occurance p1, p2, .....
- Suppose that during a long period of transmission a sequence of L messages is generated and it is very large
- Then, in the L message sequence p1L message of m1 are transmitted, p2 L message of m2 are transmitted, etc.
- The total information in such a sequence will be.
Itotal = P1L log2(1/P1) + P2L log2(1/P2) + ..............
- The average information per message will be
H = (Itotal)/L = P1log2(1/P1) + P2 log2(1/P2) + ..............
H = ∑k=1M Pklog2(1/Pk)
- This is the expression for Entropy
Entropy for different probability
- Let us consider the two message with probability P and (1-P). Then, the average information per message is
H = P.log2(1/P) + (1-P).log2(1/(1-P))
- The maximum value of H occurs at P = 1/2 when the two messages are equal likely
- The corresponding H is
Hmax = (1/2).log2(2) + (1/2).log2(2)
Hmax = 1 bit/message
- So, when there are M messages, H becomes maximum when all the messages are equally likely. In this case each message has a probability p = 1/M and
Hmax = ∑ (1/M).log2M = log2M
Property of Entropy
- Average information(Entropy) is zero if the event is sure or it is impossile i.e.,
H = 0 if Pk = 0 or 1
- When Pk = 1/M for all the M symbols, then the symbols are equally likelly. For such cases
Entropy is given as, H = log2M
- Upper bound on entropy is given as,
Hmax = log2M
Joint Entropy
- If there are two sources S1 and S2 delivering the messages xi and yj and the xi and xj are not mutually independent
- Then the probability of their simultaneous occurance is given by the joint probability P(xi, xj)
- The Entropy of the joint event (xi, xj) is known as the Joint Entropy and it is defined as,
- Thus the joint Entropy of the mutually independent sources is equal to the addition of the individual entropies
- If xi and yj are mutually independent then conditional probability defined as
- H(x) = Entropy of the source S1 and H(y) = Entropy of the the source S2
Conditional Entropy
- The conditional Entropy H ( Y/ X ) is defined as the Entropy of Y provided that X has occured
- The conditional Entropy is defined as
- If xi and yj are independent then
- So, the expression for conditional entropy is given as,
- Thus for independent messages xi and yj the conditional Entropy is equal to the marginal Entropy
Relationship between the conditional & joint Entropy
- The Expression for the joint Entropy is given as,
- Similarly, can be also prove that
H (X, Y ) = H( Y ) + H (X / Y )
- This is the required relation between the conditional and joint Entropies.
Rate of Information
- If the source generates messages at the rate r messages per second, then the information rate is defined as,
R = r. H
- Where, r = Number of message/ sec, and
H = Average information/ message
- Units of information rate:
R = [ r. Messages/ second ] . [H. Information/ messages]
R = Average information per second express in bits/sec
Problem :
- An anlog signal is bandlomited to 4 kHz and it is sampled at the Nyquist rate. The samples are quantized into 4 levels and each level represents one message .
The probability of occurance of these levels are p1= p4=1/8 and p2= p3=3/8, Find out the information rate of the source.
Slution :
- Given that, fm=4 k Hz , therefore sampling rate = 8 kHz
- So The average information per message is,
- So, the information rate is,
R = r × H
= 1.8×8 kHz
= 14400 bits/ sec
Mutual Information
- Mutual information is defined as,
- Where, P(x/y) = The conditional probability of x provided by y occurs and P(x )= Probability of x
- Mutual information measures the amount of information transferred when x is transmitted and y is received
For an ideal Noiseless Channel
- P(x/y) = 1
So, from the above equation (1)
- So, the mutual information is same as the self information
If channel noise is large which effect the 'y' and it is totally unrellated to 'x' then
- P( x/ y ) = P(x)
So, from the above equation (1)
- That means no information is transmitted
Average mutual information
- Average mutual information is defined as,
- I ( X; Y) represents the average source information gained per received symbol
Expressin for mutual information
- From the average mutual information
- Thus the mutual information is equal to the difference betweeen source Entropy H(X) and noise Entropy H(X/Y)
- If we interchange X and Y then,
Source Coding
- Coding theory is the study of the properties of codes and their fitness for a specific applications. Codes are used for dara compression, cryptography, error-correction and more recently also for network coding.
- There are two classes of codes -
1. Source coding (Data compression)
2. Channel coding (Farward error correction).
- Source coding is the process of encoding information using lesser number of bits compared to uncoded version of information.
- The knowledge of the statistics of the source is required in order to develop an efficient source code.
- In particular, can be assign short code-word to frequent source symbols and long codewords to rare source symbols. Such a source code is called variable-length coding.
- For example: Every day in the Internet common Zip data compression is used to reduce the network load and make files smaller.
Channel Coding
- Channel coding refers to the class of signal transformation designed to improve communication performance.
- It enable the transmitted signals to better withstand the effects of various channel impairments.
- The aim of channel coding theory is to find codes which transmit quickly, contain many valid code words and can correct or at least detect many errors.
Discrete Memoryless Channel
- Channel coding refers to the class of signal transformation designed to improve communication performance.
- A discrete memoryless channel (DMC) is a statistical model with an input of X and output of Y, which is a noisy version of X (here both are random variables).
- In each time slot the channel accepts an input symbol X selected from a given alphabet and emit an output symbol Y.
- The channel is said to be discrete when both of the alphabet X and Y have finite sizes.
- It is said to be memoryless when the current output symbol depends only on the current input symbol and not not on the input before or after.
Code efficiency
- Consider a discrete memoryless source on the alphabet.
Forward Transition Matrix
- Consider a discrete memoryless source on the alphabet.
- A convennient way of describing DMC is to arrange the transition probability of the channel in the form of a matrix.
- The (j, k) entry of the matrix is P(Y = yk |X = xj ), which is called forward transition probability.
- Each row of the channel matrix P corresponds to a fixed channel input, whereas each coloumn of the matrix corresponds to a fixed channel output.
Binary symmetric channel (BSC)
- It is a special case of DMC with J = K = 2. The channel has two input symbols (x0=0, x1=1)and two output symbols (y0=0, y1=1).
- The channel is symmetric because the probability of receiving a 1 if a 0 is sent is the same as the probability of receiving a 0 if a 1 is sent.
Figure : Transition probability diagram of binary symmetric channel
- The conditional probability of error is denoted by p and its transition probability matrix is given by.
Channel Capacity
- Channel capacity, C is defined as ‘the maximum mutual information I(X ; Y) in any single use of the channel, where the maximization is over all possible input probability distributions
Shannon-Fano Algorithm
- If the messages are not equally likely. Then H < N or ( H/ N ) < 1 and each binit carries less than 1 bit of average information.
- Then this type of problem can be solved by assigning a different code for each message
- The number of bits in a code word will decrease as the message become a more likely message
- The Shannon-Fano algorithm uses this technique to get an efficient code
- Consider table given below which gives five possible messages a, b, c, d and e with their corresponding probability
Message | a | b | c | d | e |
Probability | 1/15 | 1/7 | 1/6 | 1/6 | 1/5 |
Following procedure is given below to obtain the code words for each message
- List the source symbols (messages) in the order of decreasing probability.
- Partition the set in to two sets that are as close to equiprobable as possible, and assign 0 to the upper set and 1 to the lower set.
- Continue this process, each time partitioning the sets with as nearly equal probababilities as possible until further.
Fig. : Flow Chart
Problem :
- Find the codeword for each message. Also calculate the code efficiency
Prefix Coding
- The Prefix Code is variable length source coding scheme where no code is the prefix of any other code.
- The prefix code is a uniquely decodable code.
- But, the converse is not true i.e., all uniquely decodable codes may not be prefix codes.
Symbol | Prob. of Occurance | Code I | Code II | Code III |
S0 | 0.5 | 0 | 0 | 0 |
S1 | 0.25 | 1 | 10 | 01 |
S2 | 0.125 | 00 | 110 | 011 |
S3 | 0.125 | 11 | 111 | 0111 |
- The definition of prefix code is illustrated above in the table.
- From the table, code I is not a prefix code. Code II is a prefix code. Code III is also uniquely decodable but not a prefix code.
Huffman Coding
- Huffman developed a nice greedy algorithm for producing a minimum- cost (optimum) prefix code
- Purpose of Humffman code is to construct minimum redundancy codes
Procedure
- List the source symbol in order of decreasing probability
- The two source symbol of lowest probability are assigned 0 and 1
- These two source symbols are combined into a new message
- The probability of this new message is equal to the sum of probability of the two original symbols
- The probability of this new message is placed in the list according to its value
- follow this procedure till only two message remain and assign the 0 and 1 for them
Applications
- Both the .mp3 and .jpg file formats use Huffman coding at one stage of the compression
- Alternative method that achieves higher compression but is slower is patented by IBM, making Huffman codes attractive
- File compression: JPEG images, MPEG movies
- Transmission of data over band-limited channels: Modem data compression
- Vedio cards, Sound cards
Example : A discrete memoryless source has five symbols with respective probability is given below. Find the Huffman code
Ax = { a, b, c, d, e }
Px = { 0.25, 0.25, 0.2, 0.15, 0.15 }
Solution:
Message | Probability | Code | Code Length |
a | 0.25 | 00 | 2 |
b | 0.25 | 10 | 2 |
c | 0.2 | 11 | 2 |
d | 0.15 | 010 | 3 |
e | 0.15 | 011 | 3 |
Shannon's theorem on channel capacity
- A given communication system has a maximum rate of information C known as the channel capacity
- Theorem: Given a source of M equal likely messages, with M >> 1, which is generating information at a rate R. Given a channel with channel capacity C. Then,
R ≤ C
- There exists a coding technique such that the outptut of the source may be transmitted over the channel with probability of error in the received message which may be made arbitrary small
Shannon-hartley theorem
- This theorem is complementry to the Shannon's theorem. It applies to a channel in which the noise is Gaussian
- Theorem: The channel capacity of a white, bandlimited Gaussian channel is given by,
C = B . log 2(1+ S/N ) bits/sec
- where B is the channel bandwidth(BW), S is the signal power and N is the total noise within the channel BW
Trade off betweeen bandwidth and signal to noise ratio (SNR)
- By Shannon-Hartley theorem channel capacity of the Gaussian channel is given as
C = B . log 2(1+ S/N ) bits/sec
- From the above equation channel capacity depends on two factors
1. Bandwidth(B) of the channel
2. Signal to noise ratio (S/N)
Noiseless channel has infinite capacity
Infinite BW has limited capacity
- If BW is infinite, then channel capacity is limited because as BW increases, noise power also increases
- Noise power is given as
Example :
- A Gaussian channel has a 1 MHz bandwidth. Calculate the channel capacity if the signal power to noise spectral density ratio S/No is 105 Hz . Also find the maximum information rate.
Solution :
Example :
- Consider an AGWN channel with 4 KHz bandwidth and the noise power spectral density No/2 = 10-12 W/Hz . The signal power required at the receiver is 0.1 mw. Calcualte the capacity of this channel.
Solution :
Error Detection & Error Correction Codes
Need for error control coding
- In telecommunications, the detection and correction of errors is important for maintaining data integrity on noisy communication channel
- When digital signals (groups of 0s and 1s) are transmitted from one system to another system, error may occur during transmission
- The changes of a 1 to 0 or 0 to 1 during transmission is known as error
- It is necessary to detect and correct the error to obtain a correct message
- Error detection techniques detect the error while error correction technique reconstructs the original data.
- To detect or correct errors, need to send extra (redundant) bits with data.
- Error correction codes are used by the receiver. It acts alone, both to detect the presence of an error in the received data and to re-construct the data in its original form
- Error correcting codes are divided in to two types
1. Block codes
2. Convolutional codes
- The major difference between these two codes are block codes do not need memory while the convolution codes required memory
Linear Block Code
- Linear block codes are so named because each code word in the set is a linear combination of a set of generator code words
- Block codes are codes in which k bits are input called data words(message) and n bits are output called code words
- To generate an (n, k) block code, the channel encoder accepts information in successive k bit blocks
- For each block add (n-k) redundant bits to produce an encoded block of n-bits called a codeword
Fig. : Block Code.
- Linear block codes are very easy to implement in hardware, and since they are algebraically determined, they can be decoded in constant time.
- Block code is linear if and only if the modulo-2 sum of two code words is also a code word
- Sum, difference and scalar multiples of codeword are also a codeword
Encoding of (n, k) linear block code
- In encoding process codeword is generated by multiplying the data words with the generator matrix as shown below
Systematic Block Code (n, k)
- For a systematic code, the first (or last) k elements in the codeword are information bits
Decoding of (n, k) Linear Block Code
- Received bits (R) = Transmitted codeword (C) + Error (e)
- Compute the syndrome vector (S): S is syndrome of (R), corresponding to the error pattern (e). It locates the position of error in received bits.
- Complement the errored bit and extract estimated message from received codeword
Fig. : Error Detection in Block Coding.
Linear block codes applications
- Linear codes are applied in methods of transmitting symbols (e.g., bits) on a communications channel
- If errors occur in the communication, some errors can be detected by the recipient of a message block.
- Linear block codes are used in many cryptosystems
- It is very useful in situations where the BER of the channel is relatively low and bandwidth availability is limited.
- It is used for high-speed computer memory
1. In high speed memory, bandwidth is limited because the cost per bit is relatively high compared to low-speed memory like disks
2. The error rates are usually low and tend to occur by the byte so a SEC/DED (singleerror-correcting/double-error-detecting) coding scheme for each byte provides sufficient error protection
3. Error coding must be fast in this situation because high throughput is desired
Hamming code
Hamming distance
- The Hamming distance between two code words is defined as the number of locations in which their respective elements differ.
- For example consider two code words
Hamming weight
- The Hamming weight of a code word is defined as the number of nonzero elements in the code word.
- For example: Consider the codeword c =1 1 0 1 0 1 0 0 the number of non zero elements is 4 hence the hamming weight is 4.
Convolution Code
- Convolutional codes were introduced in 1955 by Elias as an alternative to Block codes. Convolutional codes are different from block codes by the existence of memory in the encoding scheme.
- Convolutional coding is known to be one of the most frequently used error correction techniques relative to digital communication.These codes can be used for correcting random
errors, burst errors or both. Convolutional codes are also known as recurrent codes.
- In convolution code the message bit stream is encoded in a continuous manner rather than in pieces as in block codes.
- In convolutional codes, each block of k bits is mapped into a block of n bits.
- These n bits are not only determined by the present k message bits but also by the previous message bits.
- It works like a state machine, where output is determined by the input and the state of the machine.
- They are generated using shift registers.
- An (n, k, m) convolution code has
k --> number of input bits
n --> number of output bits
m --> number of memory elements (flip flops) in register
Code Rate
- The quantity R is called the code rate of the encoder and is defined by
- Example 1:
➢ Number of message bits k = 1
➢ Number of encoded bits n = 2
➢ Rate ½ means for each one-input bit, encoder provides 2 output bits. Encoder operates on one-bit at a time.
➢ Suppose input sequence = 10110, then total number of encoded output bits = 5 x 2 = 10.
- Example 2:
➢ Number of message bits k = 1
➢ Number of encoded bits n = 3
➢ Rate 1/3 means for each one-input bit, encoder provides 3 output bits. Encoder operates on one-bit at a time.
➢ Suppose input sequence = 10110, then total number of encoded output bits = 5 x 3 = 15.
- Example 3:
➢ Number of message bits k = 2 (two input bits are processed at a time)
➢ Number of encoded bits n = 3 (each group of 2 inputs are transformed into 3 bits)
➢ Rate 2/3 means for each group of 2 input bit, encoder provides 3 output bits. Encoder operates on 2 bit at a time.
➢ Suppose input sequence = 1011, then total number of encoded output bits = 3 + 3 = 6.
Constaint Length
- The constraint length is defined as the number of shifts over which single message bit can can influence the encoder output
- The quantity L is called the constraint length of the code and is defined by
Constraint length, L = n (m+1)
where, m = number of memory registers
n = number of encoded bits per message bits
- It is measured in terms of encoded output bits
Code Dimension
- The dimension of the code depends on n, k and L
where, n represents the number of encoded bits per message bit, k is the number of message bits taekn at a time by the encoder and m is the encoder's memory
- The code dimension is therefore represented by (n, k, m )
Advantages of convolution codes
- The decoding delay of these codes is short,since convolutional codes operated on smaller blocks
- Convolutional codes requires a smaller amount of storage
- Loss of synchronization is not as serious as in the system
Disadvantages of convolution codes
- Complex implementation
- They are under developed codes
- Analyzing the performance is very difficult
Applications of convolution codes
- Widely used in satellite communications
- It is used in digital speech ( the required BER is 10^5 )
- It is used in digital cellular mobile communications, universal mobile telecommunication systems in order to achieving coding gain at moderate rate
- It is used in GSM systems